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ABSTRACT 

We apply multilayer bootstrap network (MBN), a recent pro¬ 
posed unsupervised learning method, to unsupervised speaker 
recognition. The proposed method first extracts supervectors 
from an unsupervised universal background model, then re¬ 
duces the dimension of the high-dimensional supervectors by 
multilayer bootstrap network, and finally conducts unsuper¬ 
vised speaker recognition by clustering the low-dimensional 
data. The comparison results with 2 unsupervised and 1 su¬ 
pervised speaker recognition techniques demonstrate the ef¬ 
fectiveness and robustness of the proposed method. 

Index Terms — multilayer bootstrap network, speaker 
recognition, unsupervised learning. 

1. INTRODUCTION 

Speaker recognition aims to identify speakers from their 
voices. It is important in many speech systems, such as 
speaker diarization, language recognition, and speech recog¬ 
nition. Supervised methods include maximum a posteriori 
estimation [1,2], linear discriminative analysis (LDA) [3,4], 
support vector machines [2], deep neural networks [5,6], etc. 

Because constructing a manually-labeled corpus is labor¬ 
ing intensive and time-consuming, it is strongly needed to 
develop unsupervised speaker recognition methods. Existing 
methods mainly include principle component analysis (PCA), 
fc-means clustering, Gaussian mixture model (GMM), ag- 
glomerative hierarchical clustering, and joint factor analysis. 
For example, Wooters and Huijbregts [7] used agglomerative 
clustering to merge speaker segments by Bayesian informa¬ 
tion criterion. Iso [8] used vector quantization to encode 
speech segments and used spectral clustering, which is a k- 
means clustering applied to a low-dimensional subspace of 
data, for speaker recognition. Nwe et al. [9] used a group of 
GMM clusterings to improve individual base GMM cluster¬ 
ings. Some methods apply clustering techniques, e.g. varia¬ 
tional Bayesian expectation-maximization (EM) GMM [10] 
and spectral clustering [11], to a low-dimensional total vari¬ 
ability subspace [4] that is learned from high-dimensional 
supervectors by joint factor analysis [4]. Some methods com¬ 
pensate the total variability space with new items, e.g. [12]. 


Because little prior knowledge of data is known before¬ 
hand, an unsupervised method should satisfy the following 
conditions: (i) no need for manually-labeled training data; (ii) 
no hyperparameter tunning for a satisfied performance; and 
(iii) robustness to different data or modeling conditions. Due 
to these strict requirements, unsupervised speaker recognition 
is a very difficult task. In this paper, we present a multilayer 
bootstrap network (MBN) [13] based algorithm. MBN is a re¬ 
cent proposed unsupervised nonlinear dimensionality reduc¬ 
tion algorithm. Experimental results show that the proposed 
method satisfies these requirements. 

This paper is organized as follows. In Section 2, we 
present the MBN-based system. In Section 3, we present 
the MBN algorithm and its typical hyperparameter setting. 
In Section 4, we present the relationship between MBN and 
deep learning. In Section 5, we report comparison results. In 
Section 6, we conclude this paper. 

2. SYSTEM 

Given an unlabeled speaker recognition corpus, we propose 
the following unsupervised algorithm:^ 

• The first step trains a speaker- and session-independent 
unsupervised universal background model (UBM) 
[1] from an acoustic feature, which produces a d- 
dimensional supervector for each utterance, denoted 
as X = [n^, where n is the accumulation of the 
mixture occupation over all frames of the utterance and 
f is the vector form of the centered first order statistics. 

• The second step reduces the dimension of x from dto d 
(d d) by multilayer bootstrap network (MBN) which 
is introduced in Section 3. 

• The third step conducts fc-means clustering on the low¬ 
dimensional data if the number of the underlying speak¬ 
ers is known, or agglomerative clustering if the number 
of the speakers is unknown. 


*The source code is downloadable from http://sites.google.com/site/ 
zhangxiaolei3 21 /speaker_recognition 
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• Sparse representation learning. The fourth step as¬ 
signs the input x to one of the k clusters and outputs 
a fc-dimensional indicator vector h = [hi,, hk]’^. 
For example, if x is assigned to the second cluster, then 
h = [ 0 , 1, 0 , ..., 0 ]^. The assignment is calculated ac¬ 
cording to the similarities between x and the k centers, 
in terms of some predehned similarity measurement at 
the bottom layer, such as the minimum squared loss 
arg min^,^ i || — x| p , or in terms of arg max^,^ ^ wf x 

at all other hidden layers [13]. 


Fig. 1. The MEN network. Each square represents a h-centers 
clustering. 



Fig. 2. Random reconstruction step in MEN. 


3. MULTILAYER BOOTSTRAP NETWORK 

The structure of MEN [13] is shown in Fig. 1. MEN is a 
multilayer localized PCA algorithm that gradually enlarges 
the area of a local region implicitly from the bottom hidden 
layer to the top hidden layer by high-dimensional sparse cod¬ 
ing, and gets a low-dimensional feature explicitly by PCA at 
the output layer. 

Each hidden layer of MEN consists of a group of mutually 
independent fc-centers clusterings. Each fc-centers clustering 
has k output units, each of which indicates one cluster. The 
output units of all clusterings are concatenated as the input of 
their upper layer [13]. 

MEN is trained layer-by-layer from bottom up. For 
training a hidden layer given a d-dimensional input X — 
{xi,...,x„}, MEN trains each clustering independently 

[13]: 

• Random feature selection. The hrst step randomly se¬ 
lects d dimensions of X (d < d) to form a new set 
X — {xi,..., x„}. This step is controlled by a hyper¬ 
parameter a = d/d. 

• Random sampling. The second step randomly selects 
k data points from X as the k centers of the clustering, 
denoted as {wi,..., w^}. This step is controlled by a 
hyperparameter k. 


3.1. A typical hyperparameter setting 

MEN has hve hyperparameters {V,L,{ki}fL^,a,r"^ where 
V is the number of fc-centers clusterings per layer, L is the 
number of hidden layers, and fc; is the hyperparameter fc at 
the Zth hidden layer. As shown in [13], MEN is robust to hy¬ 
perparameter selection. Here we introduce a typical setting: 

• Setting hyperparameter fc. (i) fci should be as large as 
possible, i.e. fci —^ n. Suppose the largest fc supported 
by hardware is fcmax. then fci = min(0.9n, fcmax)- (h) 
fc; decays with a factor of, e.g. 0.5, with the increase 
of hidden layers. That is to say, fc; = 0.5fc;_i. (iii) fc^ 
should be larger than the number of speakers c. Typi¬ 
cally, fci « 1.5c. If c is unknown, we simply set fci 
to a relatively large number, e.g. 30, since c is unlikely 
larger than 30 in a practical dialog. 

• Setting hyperparameter r. When a problem is small- 
scale, e.g. fci > 0.8n, then r = 0.5; otherwise, r = 0. 

• Setting other hyperparameters. Hyperparameter V 
should be at least larger than 100, typically V = 400. 
Hyperparameter a is hxed to 0.5. Hyperparameter L is 
determined by fc. 

4. RELATED WORK 

The proposed method learns multilayer nonlinear transforms, 
which is related to deep learning (a.k.a., multilayer neural 
networks)—a recent advanced topic in many speech pro¬ 
cessing helds, e.g. speaker recognition [5,6], speech recog¬ 
nition [14], speech separation and enhancement [15-18], 
speech synthesis [19], and voice activity detection [20,21]. 
The aforementioned deep learning methods are all supervised 
ones and limited to neural networks, while the proposed 
method is an unsupervised one and different from neural 
networks. 


5. EXPERIMENTS 


• Random reconstruction. The third step randomly se¬ 
lects d' dimensions of the fc centers {d' < d/2) and 
does a one-step cyclic-shift as shown in Fig. 2. This 
step is controlled by a hyperparameter r = d'/d. 


5.1. Experimental setup 

We used the training corpus of speech separation challenge 
(SSC) [22]. The training corpus contains 34 speakers, each 
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of which has 500 clean utterances. We selected the hrst 100 
utterances (a.k.a, sessions) of each speaker for evaluation, 
which amounts to 3400 utterances. We set the frame length 
to 25 milliseconds and frame shift to 10 milliseconds, and ex¬ 
tract a 25-dimensional MFCC feature. 

For the proposed MBN-based speaker recognition, we 
adopted the typical parameter setting of MBN. Specih- 
cally, V = 400, a = 0.5, r = 0.5, and k were set to 
3060-1530-765-382-191-95. The output of PCA was set to 
{2,3,5,10,30,50} dimensions respectively. We assumed 
that the number of speakers was known, and used fc-means 
clustering for clustering the low-dimensional data. 

We compared with PCA, fc-means clustering, and an 
LDA-based system, where the hrst two methods are unsu¬ 
pervised and the third one is supervised. For the PCA-based 
method, we hrst used the same UBM as the MBN-based 
method to extract high-dimensional supervectors, then re¬ 
duced the dimension of the supervectors to {2,3, 5,10,30, 50} 
respectively, and hnally evaluated the low-dimensional output 
of PCA by fc-means clustering. For the fc-means-clustering- 
based method, we apply fc-means clustering to the high¬ 
dimensional supervectors directly. 

The LDA-based system^ uses UBM to extract a high¬ 
dimensional feature, then uses joint factor analysis to reduce 
the high-dimensional feature to an intermediately low di¬ 
mensional representation in an unsupervised way, and hnally 
uses LDA, a supervised dimensionality reduction method, to 
reduce the intermediate representation to a low-dimensional 
subspace where classihcation is conducted by a probabilis¬ 
tic LDA algorithm. Since factor analysis is an unsuper¬ 
vised dimensionality reduction method, we set its output to 
{2,3, 5,10, 30, 50} dimensions respectively for comparison. 
We constructed a training set from the SSC corpus for this 
supervised method: each speaker consists of 100 training 
utterances, which are selected from the 400 remaining utter¬ 
ances of the speaker. 

The performance was measured by normalized mutual in¬ 
formation (NMl) [23]. MNI was proposed to overcome the la¬ 
bel indexing problem between the ground-truth labels and the 
predicted labels. It is one of the standard evaluation metrics of 
unsupervised learning. The higher the NMI is, the better the 
performance is. We also report the classihcation accuracy of 
the LDA-based system in the Supplementary Material^ where 
we can see that NMI is consistent with classihcation accuracy. 

5.2. Results 

Because all comparison methods use UBM to extract speaker- 
and session-independent supervectors, we need to study how 
they behave in different UBM settings, in terms of mixture 
number and expectation-maximization (EM) iterations, (i) 

^The source code is downloadable from http;//research.microsoft.com/en- 
us/downloads/a6262fec-03a7-4060-a08c-0b0d037a3f5b/ 

^http://sites.google.coin/site/ zhangxiaolei321/speaker_recognition 
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Fig. 3. Visualizations of 10 speakers by PCA and MBN re¬ 
spectively, where a 16-mixtures UBM with 20 EM iterations 
is used to produce their input supervectors. The speakers are 
labeled in different colors. 

The mixture number of UBM rehects the capacity of UBM 
for modelling an underlying data distribution: if the mix¬ 
ture number of UBM is smaller than the number of speakers, 
UBM is likely underfitting, i.e. it cannot grasp the data distri¬ 
bution well. To study this effect, we set the mixture number of 
UBM to {1, 2,4, 8,16, 32, 64} respectively, (ii) The number 
of EM iterations of UBM rehects the quality of the acoustic 
feature produced by UBM: if the EM optimization is not suffi¬ 
cient, the acoustic feature is noisy. To study this effect, we set 
the number of EM iterations of UBM to {0, 20} respectively, 
where setting the number of iterations to 0 means that UBM 
is initialized with randomly sampled means without EM opti¬ 
mization, which is the worst case. 

Eig. 3 and Supplementary-Eig. 1 give a comparison 
example between PCA and MBN in visualizing the hrst 10 
speakers, where a 16-mixtures UBM with 20 and 0 EM iter¬ 
ation are used to generate their inputs respectively. Erom the 
hgures, we can see that MBN produces ideal visualizations. 

Eig. 4 reports results with respect to the mixture number 
of UBM. Eig. 5 reports results with respect to the number 
of output dimensions. Supplementary-Tables 1 and 3 report 
the detailed results of the two hgures. Erom the hgures and 
tables, we observe the following phenomena: (i) the MBN- 
based method outperforms the PCA- and fc-means-clustering- 
based methods and approaches to the supervised LDA system 
in all cases; (ii) the MBN-based method is less sensitive to dif¬ 
ferent parameter settings of both UBM and MBN itself; (iii) 
the LDA-based system is less sensitive to the mixture number 
of UBM, but sensitive to the number of output dimensions; 
(iv) the PCA-based method is sensitive to both the mixture 
number of UBM and the number of output dimensions, and 
strongly relies on the effectiveness of UBM; (v) the perfor¬ 
mance of the fc-means-clustering-based method is consistent 
with that of the PCA-based method. 

Pig. 6 reports results of the MBN-based method with re¬ 
spect to the number of hidden layers. Prom the figure, we oh- 
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Fig. 6. Accuracy (in terms of NMI) of MBN-based method with respect to the number of hidden layers. 
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Fig. 4. Accuracy comparison (in terms of NMI) between 
LDA-, fc-means clustering-, PCA-, and MBN-based speaker 
recognition methods with respect to the mixture number of 
UBM. (a) Comparison when the EM iteration number of 
UBM is set to 20. (b) Comparison when the EM iteration 
number of UBM is set to 0. Note that given a mixture number 
of UBM, the accuracy of a method is the best result among 
the results produced from 6 candidate output dimensions of 
the method, except fc-means clustering. 

serve that the accuracy improves gradually with the increase 
of the number of hidden layers. 

6. CONCLUSIONS 

In this paper, we have proposed a multilayer bootstrap net¬ 
work based unsupervised speaker recognition algorithm. The 
method first uses UBM to extract a high-dimensional feature 
from the original MECC acoustic feature, then uses MBN 
to reduce the high-dimensional feature to a low-dimensional 
space, and finally clustering the low-dimensional data. We 
have compared it with the PCA-, /c-means-clustering-, and 
LDA-based methods, where the first two methods are un¬ 
supervised and the third method is supervised. Experimen¬ 
tal results have shown that the proposed method outperforms 


(a) UBM with 20 EM iterations (b) UBM with 0 EM iteration 




Fig. 5. Accuracy comparison (in terms of NMI) between 
EDA-, PCA-, and MBN-based speaker recognition methods 
with respect to the number of output dimensions, (a) Com¬ 
parison when the EM iteration number of UBM is set to 20. 
(b) Comparison when the EM iteration number of UBM is 
set to 0. Note that given a number of output dimensions, the 
accuracy of a method is the best result among the results pro¬ 
duced from 7 candidate UBMs. 

the unsupervised methods and approaches to the supervised 
method. Moreover, it is insensitive to different parameter set¬ 
tings of UBM and MBN, which facilitates its practical use. 
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